AXXXXXXXXXXXXXXXXXXXXXX

1. Welcome to our tutorial about the Quanteda package!

YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY

Step 1 Load the packages which we will use (don’t forget to install it before!).

library(readtext)
library(tidyverse)
library(quanteda)
library(rvest)
library(quanteda.textstats)
library(quanteda.textplots)
library(stringr)
library(spacyr)
library(xml2)
library(ggsci)

2. Scraping Wikipedia to get the data

If you don’t want to go through this, you might go to the session “3. Alternatively… import a CSV file!”.

ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ

Step 2.1 Scrap XXXXXXXXXX to get the episodes transcriptions.

datalist_himym <- readtext::readtext("texts/how-i-met-your-mother/*.txt")
paste("CHECK WITH JORGE WHERE DID HE TAKE THE EPISODES TRANSCRIPTS")
## [1] "CHECK WITH JORGE WHERE DID HE TAKE THE EPISODES TRANSCRIPTS"

Step 2.2 Scrap Wikipedia to get the episodes and characters metadata and clean it!

### CHECK WITH JORGE WHERE DID HE TAKE THE EPISODES TRANSCRIPTS

url_himym <- "https://en.wikipedia.org/wiki/List_of_How_I_Met_Your_Mother_episodes"

url_himym_characters <- "https://en.wikipedia.org/wiki/List_of_How_I_Met_Your_Mother_characters"

l_tables_himym <- url_himym %>% 
  read_html() %>% 
  html_nodes("table") %>% 
  html_table(fill = TRUE)
l_tables_himym
## [[1]]
## # A tibble: 10 × 6
##    Season Episodes Episodes `Originally aired`              Originally a…¹ Rank 
##    <chr>  <chr>    <chr>    <chr>                           <chr>          <chr>
##  1 Season Episodes Episodes First aired                     Last aired     Rank 
##  2 1      22       22       September 19, 2005 (2005-09-19) May 15, 2006 … 54   
##  3 2      22       22       September 18, 2006 (2006-09-18) May 14, 2007 … 61   
##  4 3      20       20       September 24, 2007 (2007-09-24) May 19, 2008 … 70   
##  5 4      24       24       September 22, 2008 (2008-09-22) May 18, 2009 … 49   
##  6 5      24       24       September 21, 2009 (2009-09-21) May 24, 2010 … 42   
##  7 6      24       24       September 20, 2010 (2010-09-20) May 16, 2011 … 48   
##  8 7      24       24       September 19, 2011 (2011-09-19) May 14, 2012 … 45   
##  9 8      24       24       September 24, 2012 (2012-09-24) May 13, 2013 … 42   
## 10 9      24       24       September 23, 2013 (2013-09-23) March 31, 201… 28   
## # … with abbreviated variable name ¹​`Originally aired`
## 
## [[2]]
## # A tibble: 22 × 8
##    No.overall `No. inseason` Title       Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
##         <int>          <int> <chr>       <chr>   <chr>   <chr>   <chr>   <chr>  
##  1          1              1 "\"Pilot\"" Pamela… Carter… Septem… 1ALH79  10.94[…
##  2          2              2 "\"Purple … Pamela… Carter… Septem… 1ALH01  10.40[…
##  3          3              3 "\"Sweet T… Pamela… Phil L… Octobe… 1ALH02  10.44[…
##  4          4              4 "\"Return … Pamela… Kourtn… Octobe… 1ALH03  9.84[1…
##  5          5              5 "\"Okay Aw… Pamela… Chris … Octobe… 1ALH04  10.14[…
##  6          6              6 "\"Slutty … Pamela… Brenda… Octobe… 1ALH05  10.89[…
##  7          7              7 "\"Matchma… Pamela… Chris … Novemb… 1ALH07  10.55[…
##  8          8              8 "\"The Due… Pamela… Gloria… Novemb… 1ALH06  10.35[…
##  9          9              9 "\"Belly F… Pamela… Phil L… Novemb… 1ALH09  10.29[…
## 10         10             10 "\"The Pin… Pamela… Carter… Novemb… 1ALH08  12.27[…
## # … with 12 more rows, and abbreviated variable names ¹​`Directed by`,
## #   ²​`Written by`, ³​`Original air date`, ⁴​Prod.code, ⁵​`US viewers(millions)`
## 
## [[3]]
## # A tibble: 22 × 8
##    No.overall `No. inseason` Title       Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
##         <int>          <int> <chr>       <chr>   <chr>   <chr>   <chr>   <chr>  
##  1         23              1 "\"Where W… Pamela… Carter… Septem… 2ALH01  10.48[…
##  2         24              2 "\"The Sco… Rob Gr… Chris … Septem… 2ALH02  9.14[3…
##  3         25              3 "\"Brunch\… Pamela… Stephe… Octobe… 2ALH03  9.32[3…
##  4         26              4 "\"Ted Mos… Pamela… Kristi… Octobe… 2ALH04  9.09[3…
##  5         27              5 "\"World's… Pamela… Brenda… Octobe… 2ALH06  9.05[3…
##  6         28              6 "\"Aldrin … Pamela… Jamie … Octobe… 2ALH05  9.59[3…
##  7         29              7 "\"Swarley… Pamela… Greg M… Novemb… 2ALH07  8.22[3…
##  8         30              8 "\"Atlanti… Pamela… Maria … Novemb… 2ALH08  9.33[3…
##  9         31              9 "\"Slap Be… Pamela… Kourtn… Novemb… 2ALH09  8.85[3…
## 10         32             10 "\"Single … Pamela… Kristi… Novemb… 2ALH10  9.85[3…
## # … with 12 more rows, and abbreviated variable names ¹​`Directed by`,
## #   ²​`Written by`, ³​`Original air date`, ⁴​Prod.code, ⁵​`US viewers(millions)`
## 
## [[4]]
## # A tibble: 20 × 8
##    No.overall `No. inseason` Title       Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
##         <int>          <int> <chr>       <chr>   <chr>   <chr>   <chr>   <chr>  
##  1         45              1 "\"Wait fo… Pamela… Carter… Septem… 3ALH01  8.12[4…
##  2         46              2 "\"We're N… Pamela… Chris … Octobe… 3ALH02  7.88[5…
##  3         47              3 "\"Third W… Pamela… David … Octobe… 3ALH04  7.96[5…
##  4         48              4 "\"Little … Rob Gr… Kourtn… Octobe… 3ALH03  7.71[5…
##  5         49              5 "\"How I M… Pamela… Gloria… Octobe… 3ALH05  8.50[5…
##  6         50              6 "\"I'm Not… Pamela… Jonath… Octobe… 3ALH07  8.55   
##  7         51              7 "\"Dowiset… Pamela… Brenda… Novemb… 3ALH06  8.77[5…
##  8         52              8 "\"Spoiler… Pamela… Stephe… Novemb… 3ALH08  8.58[5…
##  9         53              9 "\"Slapsgi… Pamela… Matt K… Novemb… 3ALH09  8.06[5…
## 10         54             10 "\"The Yip… Pamela… Jamie … Novemb… 3ALH10  7.91[5…
## 11         55             11 "\"The Pla… Pamela… Carter… Decemb… 3ALH11  8.49   
## 12         56             12 "\"No Tomo… Pamela… Carter… March … 3ALH12  9.73   
## 13         57             13 "\"Ten Ses… Pamela… Chris … March … 3ALH14  10.67[…
## 14         58             14 "\"The Bra… Pamela… Joe Ke… March … 3ALH13  9.50[5…
## 15         59             15 "\"The Cha… Pamela… Carter… April … 3ALH15  7.99[6…
## 16         60             16 "\"Sandcas… Pamela… Kourtn… April … 3ALH16  8.45[6…
## 17         61             17 "\"The Goa… Pamela… Stephe… April … 3ALH17  8.84[6…
## 18         62             18 "\"Rebound… Pamela… Jamie … May 5,… 3ALH18  8.36[6…
## 19         63             19 "\"Everyth… Pamela… Jonath… May 12… 3ALH19  8.93[6…
## 20         64             20 "\"Miracle… Pamela… Carter… May 19… 3ALH20  7.99[6…
## # … with abbreviated variable names ¹​`Directed by`, ²​`Written by`,
## #   ³​`Original air date`, ⁴​Prod.code, ⁵​`US viewers(millions)`
## 
## [[5]]
## # A tibble: 24 × 8
##    No.overall `No. inseason` Title       Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
##         <int>          <int> <chr>       <chr>   <chr>   <chr>   <chr>   <chr>  
##  1         65              1 "\"Do I Kn… Pamela… Carter… Septem… 4ALH01  9.79[6…
##  2         66              2 "\"The Bes… Pamela… Carter… Septem… 4ALH02  8.72[6…
##  3         67              3 "\"I Heart… Pamela… Greg M… Octobe… 4ALH04  8.97[6…
##  4         68              4 "\"Interve… Michae… Stephe… Octobe… 4ALH03  9.25[6…
##  5         69              5 "\"Shelter… Pamela… Chris … Octobe… 4ALH05  9.45[7…
##  6         70              6 "\"Happily… Pamela… Jamie … Novemb… 4ALH06  9.40[7…
##  7         71              7 "\"Not a F… Pamela… Robia … Novemb… 4ALH07  9.79[7…
##  8         72              8 "\"Woooo!\… Pamela… Carter… Novemb… 4ALH09  9.99[7…
##  9         73              9 "\"The Nak… Pamela… Joe Ke… Novemb… 4ALH08  10.04[…
## 10         74             10 "\"The Fig… Pamela… Theres… Decemb… 4ALH10  10.49[…
## # … with 14 more rows, and abbreviated variable names ¹​`Directed by`,
## #   ²​`Written by`, ³​`Original air date`, ⁴​Prod.code, ⁵​`US viewers(millions)`
## 
## [[6]]
## # A tibble: 24 × 8
##    No.overall `No. inseason` Title       Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
##         <int>          <int> <chr>       <chr>   <chr>   <chr>   <chr>   <chr>  
##  1         89              1 "\"Definit… Pamela… Carter… Septem… 5ALH01  9.09[9…
##  2         90              2 "\"Double … Pamela… Matt K… Septem… 5ALH02  8.73[9…
##  3         91              3 "\"Robin 1… Pamela… Carter… Octobe… 5ALH03  8.23[9…
##  4         92              4 "\"The Sex… Pamela… Kourtn… Octobe… 5ALH04  8.56[9…
##  5         93              5 "\"Duel Ci… Pamela… Chuck … Octobe… 5ALH05  8.07[9…
##  6         94              6 "\"Bagpipe… Pamela… Robia … Novemb… 5ALH06  8.82[9…
##  7         95              7 "\"The Rou… Pamela… Chris … Novemb… 5ALH07  8.82[9…
##  8         96              8 "\"The Pla… Pamela… Carter… Novemb… 5ALH08  8.44[9…
##  9         97              9 "\"Slapsgi… Pamela… Jamie … Novemb… 5ALH09  8.75[9…
## 10         98             10 "\"The Win… Pamela… Joe Ke… Decemb… 5ALH11  8.79[9…
## # … with 14 more rows, and abbreviated variable names ¹​`Directed by`,
## #   ²​`Written by`, ³​`Original air date`, ⁴​Prod.code, ⁵​`US viewers(millions)`
## 
## [[7]]
## # A tibble: 24 × 8
##    No.overall `No. inseason` Title       Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
##         <int>          <int> <chr>       <chr>   <chr>   <chr>   <chr>   <chr>  
##  1        113              1 "\"Big Day… Pamela… Carter… Septem… 6ALH01  8.79[1…
##  2        114              2 "\"Cleanin… Pamela… Stephe… Septem… 6ALH02  9.00[1…
##  3        115              3 "\"Unfinis… Pamela… Jamie … Octobe… 6ALH03  8.60[1…
##  4        116              4 "\"Subway … Pamela… Chris … Octobe… 6ALH04  8.48[1…
##  5        117              5 "\"Archite… Pamela… Carter… Octobe… 6ALH05  8.05[1…
##  6        118              6 "\"Baby Ta… Pamela… Joe Ke… Octobe… 6ALH07  8.29[1…
##  7        119              7 "\"Canning… Pamela… Chuck … Novemb… 6ALH06  8.88[1…
##  8        120              8 "\"Natural… Pamela… Carter… Novemb… 6ALH09  8.87[1…
##  9        121              9 "\"Glitter… Pamela… Kourtn… Novemb… 6ALH08  8.87[1…
## 10        122             10 "\"Blitzgi… Pamela… Theres… Novemb… 6ALH10  8.73[1…
## # … with 14 more rows, and abbreviated variable names ¹​`Directed by`,
## #   ²​`Written by`, ³​`Original air date`, ⁴​Prod.code, ⁵​`US viewers(millions)`
## 
## [[8]]
## # A tibble: 23 × 8
##    No.overall `No. inseason` Title       Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
##         <int>          <int> <chr>       <chr>   <chr>   <chr>   <chr>   <chr>  
##  1        137              1 "\"The Bes… Pamela… Carter… Septem… 7ALH01  11.00[…
##  2        138              2 "\"The Nak… Pamela… Stephe… Septem… 7ALH02  12.22[…
##  3        139              3 "\"Ducky T… Rob Gr… Carter… Septem… 7ALH03  10.50[…
##  4        140              4 "\"The Sti… Pamela… Kourtn… Octobe… 7ALH04  10.39[…
##  5        141              5 "\"Field T… Pamela… Jamie … Octobe… 7ALH06  8.89[1…
##  6        142              6 "\"Mystery… Pamela… Chuck … Octobe… 7ALH05  9.81[1…
##  7        143              7 "\"Noretta… Pamela… Matt K… Octobe… 7ALH07  9.87[1…
##  8        144              8 "\"The Slu… Pamela… Tami S… Octobe… 7ALH08  10.49[…
##  9        145              9 "\"Disaste… Michae… Robia … Novemb… 7ALH09  10.28[…
## 10        146             10 "\"Tick Ti… Pamela… Chris … Novemb… 7ALH10  10.42[…
## # … with 13 more rows, and abbreviated variable names ¹​`Directed by`,
## #   ²​`Written by`, ³​`Original air date`, ⁴​Prod.code, ⁵​`US viewers(millions)`
## 
## [[9]]
## # A tibble: 23 × 8
##    No.overall `No. inseason` Title       Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
##         <int>          <int> <chr>       <chr>   <chr>   <chr>   <chr>   <chr>  
##  1        161              1 "\"Farhamp… Pamela… Carter… Septem… 8ALH01  8.84[1…
##  2        162              2 "\"The Pre… Pamela… Carter… Octobe… 8ALH02  8.17[1…
##  3        163              3 "\"Nannies… Pamela… Chuck … Octobe… 8ALH03  7.82[1…
##  4        164              4 "\"Who Wan… Pamela… Matt K… Octobe… 8ALH04  7.93[1…
##  5        165              5 "\"The Aut… Pamela… Kourtn… Novemb… 8ALH06  7.22[1…
##  6        166              6 "\"Splitsv… Pamela… Stephe… Novemb… 8ALH05  7.95[1…
##  7        167              7 "\"The Sta… Pamela… Tami S… Novemb… 8ALH07  7.45[1…
##  8        168              8 "\"Twelve … Pamela… Eric F… Novemb… 8ALH08  8.73[1…
##  9        169              9 "\"Lobster… Pamela… Barbar… Decemb… 8ALH09  8.26[1…
## 10        170             10 "\"The Ove… Pamela… Craig … Decemb… 8ALH10  8.82[1…
## # … with 13 more rows, and abbreviated variable names ¹​`Directed by`,
## #   ²​`Written by`, ³​`Original air date`, ⁴​Prod.code, ⁵​`US viewers(millions)`
## 
## [[10]]
## # A tibble: 23 × 8
##    No.overall `No. inseason` Title       Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
##         <int>          <int> <chr>       <chr>   <chr>   <chr>   <chr>   <chr>  
##  1        185              1 "\"The Loc… Pamela… Carter… Septem… 9ALH01  9.40[1…
##  2        186              2 "\"Coming … Pamela… Carter… Septem… 9ALH02  9.40[1…
##  3        187              3 "\"Last Ti… Pamela… Craig … Septem… 9ALH04  7.87[1…
##  4        188              4 "\"The Bro… Pamela… Matt K… Octobe… 9ALH03  7.53[1…
##  5        189              5 "\"The Pok… Pamela… Dan Gr… Octobe… 9ALH05  7.98[1…
##  6        190              6 "\"Knight … Pamela… Chris … Octobe… 9ALH07  7.64[1…
##  7        191              7 "\"No Ques… Pamela… Stephe… Octobe… 9ALH06  7.63[1…
##  8        192              8 "\"The Lig… Pamela… Rachel… Novemb… 9ALH08  8.67[1…
##  9        193              9 "\"Platoni… Pamela… George… Novemb… 9ALH10  8.08[1…
## 10        194             10 "\"Mom and… Pamela… Carter… Novemb… 9ALH09  8.11[1…
## # … with 13 more rows, and abbreviated variable names ¹​`Directed by`,
## #   ²​`Written by`, ³​`Original air date`, ⁴​Prod.code, ⁵​`US viewers(millions)`
## 
## [[11]]
## # A tibble: 10 × 26
##    Season Season Episo…¹ Episo…² Episo…³ Episo…⁴ Episo…⁵ Episo…⁶ Episo…⁷ Episo…⁸
##    <chr>  <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 "Seas… Season    1       2       3       4       5       6       7       8   
##  2 ""     1        10.9    10.4    10.4     9.84   10.1    10.9    10.6    10.4 
##  3 ""     2        10.5     9.14    9.32    9.09    9.05    9.59    8.22    9.33
##  4 ""     3         8.12    7.88    7.96    7.71    8.5     8.55    8.77    8.58
##  5 ""     4         9.79    8.72    8.97    9.25    9.45    9.4     9.79    9.99
##  6 ""     5         9.09    8.73    8.23    8.56    8.07    8.82    8.82    8.44
##  7 ""     6         8.79    9       8.6     8.48    8.05    8.29    8.88    8.87
##  8 ""     7        11      12.2    10.5    10.4     8.89    9.81    9.87   10.5 
##  9 ""     8         8.84    8.17    7.82    7.93    7.22    7.95    7.45    8.73
## 10 ""     9         9.4     9.4     7.87    7.53    7.98    7.64    7.63    8.67
## # … with 16 more variables: `Episode number` <dbl>, `Episode number` <dbl>,
## #   `Episode number` <dbl>, `Episode number` <dbl>, `Episode number` <dbl>,
## #   `Episode number` <dbl>, `Episode number` <dbl>, `Episode number` <dbl>,
## #   `Episode number` <dbl>, `Episode number` <dbl>, `Episode number` <dbl>,
## #   `Episode number` <dbl>, `Episode number` <dbl>, `Episode number` <dbl>,
## #   `Episode number` <dbl>, `Episode number` <dbl>, and abbreviated variable
## #   names ¹​`Episode number`, ²​`Episode number`, ³​`Episode number`, …
## 
## [[12]]
## # A tibble: 4 × 2
##   .mw-parser-output .navbar{display:inline;font-size:88%;font-weight:n…¹ .mw-p…²
##   <chr>                                                                  <chr>  
## 1 Episodes                                                               "Seaso…
## 2 Characters                                                             "Ted M…
## 3 Soundtracks                                                            "\"Let…
## 4 Related articles                                                       "Award…
## # … with abbreviated variable names
## #   ¹​`.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}vteHow I Met Your Mother`,
## #   ²​`.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}vteHow I Met Your Mother`
## 
## [[13]]
## # A tibble: 11 × 2
##    `vteHow I Met Your Mother episodes`  `vteHow I Met Your Mother episodes`     
##    <chr>                                <chr>                                   
##  1 "Seasons: 1\n2\n3\n4\n5\n6\n7\n8\n9" "Seasons: 1\n2\n3\n4\n5\n6\n7\n8\n9"    
##  2 "Season 1"                           "\"Pilot\"\n\"Purple Giraffe\"\n\"Belly…
##  3 "Season 2"                           "\"Where Were We?\"\n\"Ted Mosby: Archi…
##  4 "Season 3"                           "\"Wait for It\"\n\"We're Not from Here…
##  5 "Season 4"                           "\"Do I Know You?\"\n\"The Best Burger …
##  6 "Season 5"                           "\"Definitions\"\n\"Double Date\"\n\"Ro…
##  7 "Season 6"                           "\"Big Days\"\n\"Cleaning House\"\n\"Un…
##  8 "Season 7"                           "\"The Best Man\"\n\"The Naked Truth\"\…
##  9 "Season 8"                           "\"Farhampton\"\n\"The Pre-Nup\"\n\"Nan…
## 10 "Season 9"                           "\"The Locket\"\n\"Coming Back\"\n\"Las…
## 11 "Category"                           "Category"
#This generates a list with all the tables that contain the page. In our case, 
#we want the table from the second element till the 10th. 
l_tables_himym <- l_tables_himym[c(2:10)]



#Data cleaning to obtain clean tables

#Reduce the list in one data frame since all of the tables share the same structure 
df_himym <- data.frame(Reduce(bind_rows, l_tables_himym)) 

#We do the same for the characters of HIMYM
l_tables_himym_characters <- url_himym_characters %>% 
  read_html() %>% 
  html_nodes("table") %>% 
  html_table(fill = TRUE)

df_characters <- as.data.frame(l_tables_himym_characters[[1]]) %>% 
  select(Character)

df_characters_w <- df_characters %>% 
  filter(!stringr::str_starts(Character, "Futu"),
         !(Character %in% c("Character", "Main Characters", "Supporting Characters"))) %>% 
  mutate(name = str_extract(Character,"([^ ]+)"),
         name = replace(name, name == "Dr.", "Sonya"))

df_himym <- data.frame(Reduce(bind_rows, l_tables_himym)) 

df_himym_filt <- df_himym %>% filter(str_length(No.overall) < 4)

df_himym_filt_dupl <- df_himym %>% filter(str_length(No.overall) > 4)

df_himym_filt_dupl_1 <- df_himym_filt_dupl %>% 
  mutate(No.overall = as.numeric(replace(No.overall, str_length(No.overall) > 4, substr(No.overall, 1, 3))),
         No..inseason = as.numeric(replace(No..inseason, str_length(No..inseason) > 3, substr(No..inseason, 1, 2))),
         Prod.code = replace (Prod.code, str_length(Prod.code) > 3, substr(Prod.code, 1, 6)))

df_himym_filt_dupl_2 <- df_himym_filt_dupl %>% 
  mutate(No.overall = as.numeric(replace(No.overall, str_length(No.overall) > 4, substr(No.overall, 4, 6))),
         No..inseason = as.numeric(replace(No..inseason, str_length(No..inseason) > 3, substr(No..inseason, 3, 4))),
         Title = replace(Title, Title == "\"The Magician's Code\"", "\"The Magician's Code Part 2\""),
         Title = replace(Title, Title == "\"The Final Page\"", "\"The Final Page Part 2\""),
         Title = replace(Title, Title == "\"Last Forever\"" , "\"Last Forever Part 2\"" ),
         Prod.code = replace(Prod.code, str_length(Prod.code) > 3, substr(Prod.code, 7, 12)))

df_himym_final <- bind_rows(df_himym_filt, 
                            df_himym_filt_dupl_1, 
                            df_himym_filt_dupl_2) %>% 
  arrange(No.overall, No..inseason) %>% 
  mutate(year = str_extract(Original.air.date, '[0-9]{4}+'),
         Season = as.numeric(stringr::str_extract(Prod.code, "^.{1}"))) %>% 
  rename(Chapter = No..inseason)

df_himym_final$US.viewers.millions. <- as.numeric(str_replace_all(df_himym_final$US.viewers.millions., "\\[[0-9]+\\]", ""))

datalist_himym <- readtext::readtext("texts/how-i-met-your-mother/*.txt")

v_season <- as.numeric(stringr::str_extract(datalist_himym$doc_id, "\\d+"))

v_chapter <- as.numeric(stringi::stri_extract_last_regex(datalist_himym$doc_id, "[0-9]+"))

datalist_himym_w <- datalist_himym %>% mutate(Season = v_season, Chapter = v_chapter)

df_himym_final_doc <- full_join(as.data.frame(datalist_himym_w), df_himym_final, by = c("Season", "Chapter")) %>% 
  mutate(Season_w = paste("Season", Season))

## Final dataframe to be used as a corpus text

df_himym_final_doc
#write.csv(df_himym_final_doc, file = "df_himym_final_doc.csv")
df_himym_final_doc

3. Alternatively… import the CSV files!

If you went through session 2, you already get the data raw and can jump to session “4. Uploading data into Quanteda corpus”

Step 3.1 Import the csv database

# We need to think about an offline solution
#input <- read.csv("XXXXXXX.csv", stringsAsFactors = F)

#df_himym_final_doc <- 

4. Uploading data into Quanteda corpus

OK! It’s showtime! Let’s upload all our data into a Quanteda corpus element.

Step 4.1 First Step: Define a corpus

corp_himym <- corpus(df_himym_final_doc)  # build a new corpus from the texts

docnames(corp_himym) <- df_himym_final_doc$Title

summary(corp_himym, n = 15)

Step 4.2 Convert corpus into tokens and wrangle it

corp_himym_cleaned <- tokens(corp_himym, remove_punct = TRUE,
                        remove_separators = TRUE, 
                        remove_numbers = TRUE, 
                        remove_symbols = TRUE) %>%
  tokens_remove(stopwords("english"))

Step 4.3 Create a DFM.

dfm_himym_cleaned <- corp_himym_cleaned %>% dfm()

tstat_simil <- textstat_simil(dfm_himym_cleaned)

tstat_dist <- textstat_dist(dfm_himym_cleaned)

Let’s check the new subsetted DFM

dfm_himym_cleaned

5. Now… let’s have fun!

Now that we already upload the data and created the Quanteda elements, we can try some basics analysis.

Step 5.1. Let’s choose a specific season

# Filter corpus and create a subset DFM.

Season_choice <- 1

corpus_subset <- corpus_subset(corp_himym, Season == Season_choice)

tokens_himym_cleaned <- tokens(corpus_subset, remove_punct = TRUE,
                        remove_separators = TRUE, 
                        remove_numbers = TRUE, 
                        remove_symbols = TRUE) %>%
  tokens_remove(stopwords("english"))

dfm_himym_subsetted <- tokens_himym_cleaned %>% dfm()

Let’s check the new subsetted DFM

dfm_himym_subsetted

Step 5.2 Do you think we should watch this show in the original sequence? Let’s check the similarity between episodes!

### Similarity between episodes

clust <- hclust(as.dist(tstat_simil))
dclust <- as.dendrogram(clust)

dclust <- reorder(dclust, 1:22)

nodePar <- list(lab.cex = 1, pch = c(NA, 19), 
                cex.axis=1.5,
                cex = 2, col = "#0080ff")

#Talk about different methods above the correlation 
par(mar=c(15,7,2,1))

plot(dclust,nodePar = nodePar,
     cex.lab=2, cex.axis=2, cex.main=2, cex.sub=2,
     main="How I Met Your Mother Season 1",
     type = "triangle",ylim=c(0,1),
     ylab = "Similarity between episodes (correlation %)",
     edgePar = list(col = 4:7, lwd = 7:7),
     panel.first = abline(h=c(seq(.10, 1, .10)), col="grey80"))

rect.hclust(clust, k = 5, border = "#ffb1b1")

Step 5.3 Is there a correlation between the episodes?

## Distance between episodes (by correlation)

tstat_dist <- textstat_dist(toks_himym_dm_s1)
clust_dist <- hclust(as.dist(tstat_dist))
dclust_dist <- as.dendrogram(clust_dist)

dclust_dist <- reorder(dclust_dist, 22:1)

nodePar_2 <- list(lab.cex = 1.2, pch = c(NA, 19), 
                  cex = 1.8, col = "#ffc733")

par(mar=c(15,7,2,1))

plot(dclust_dist, nodePar = nodePar_2,
     cex.lab=2, cex.axis=2, cex.main=2, cex.sub=2,
     main="How I Met Your Mother Season 1",
     type = "triangle",ylim=c(0,120),
     ylab = "Distance between episodes (correlation %)",
     edgePar = list(col = 18:19, lwd = 7:7),
     panel.first =abline(h=c(seq(10, 120, 10)), col="grey80"))
rect.hclust(clust_dist, k = 5, border = "#d80000")

Step 5.4 What is the main actor for you? Does it depend on the season?

# Appearances of actors by season

## Characters by season

toks_himym <- tokens(corp_himym, remove_punct = TRUE,
                     remove_separators = TRUE, 
                     remove_numbers = TRUE, 
                     remove_symbols = TRUE) %>%
  tokens_remove(stopwords("english"))#Add additional words

df_actors <- toks_himym %>% tokens_select(c("Ted", "Marshall", "Lily", "Robin", "Barney", "Mother")) %>%
  tokens_group(groups = Season) %>% 
  dfm()


df_final_actors <-  as.data.frame(textstat_frequency(df_actors, groups = c(1:9)))

ggplot(df_final_actors, aes(x = group, y = frequency, group = feature, color = feature)) +
  geom_line() +
  geom_point() +
  ylim(0,580)
# Let's try a different aproach!

## Wordcloud of principal characters that appears in HIMYM

toks_himym_characters <- tokens(corp_himym, remove_punct = TRUE,
                                remove_separators = TRUE, 
                                remove_numbers = TRUE, 
                                remove_symbols = TRUE) %>% 
  tokens_keep(c(unique(df_characters_w$name)))

dfm_general_characters <- toks_himym_characters %>%
  dfm()


textplot_wordcloud(dfm_general_characters, 
                   random_order = FALSE, 
                   rotation = 0.25,
                   min_count = 1, #Minimum frequency
                   color = RColorBrewer::brewer.pal(4, "Dark2"))

Step 5.5 What is the most important secondary characters that appears in the TV show?

## Wordcloud of secondary characters that appears in HIMYM 


toks_himym_sec_characters <- tokens(corp_himym, 
                                    remove_punct = TRUE,
                                    remove_separators = TRUE, 
                                    remove_numbers = TRUE, 
                                    remove_symbols = TRUE) %>% 
  tokens_keep(c(unique(df_characters_w$name))) %>% 
  tokens_remove(c("Ted", "Barney", "Lily", "Robin", "Marshall"))

dfm_general_sec_characters <- toks_himym_sec_characters %>%
  dfm()

textplot_wordcloud(dfm_general_sec_characters, 
                   random_order = FALSE, 
                   rotation = 0.25,
                   #comparison = TRUE,
                   labelsize = 1.5,
                   min_count = 1, #Minimum frequency
                   color = RColorBrewer::brewer.pal(8, "Spectral"))

Step 1 XXXXXXXXXXXXXXXXXXXXXXXXXXX`

#YYYYYYYYYYYYYYYY

XXXX Session

ThWWWWWWWWWWWWWWWWWWWW

Step 1 XXXXXXXXXXXXXXXXXXXXXXXXXXX`

#YYYYYYYYYYYYYYYY

Sources

ZZZZZZZZZZZZZZZ